Analyzing Airbnb Prices in New York

Jihyun Kim

Final Project for Data Bootcamp, Fall 2016

What determines each Airbnb's listing price?

Background

Everything in New York is expensive. For first time travelers, New York may seem even more expensive. At the same time, travelers have different wants and needs from their accomodation that a student or a working person would. So I wanted to analyze the price trend of the Airbnb listing prices in the eyes of a traveler.

Travelers of different budget and purpose would have different priorities, but most would definately prefer good accessibility to the top tourist attractions they want to visit. Will this have an effect on the Airbnb rental price?

Data Source

For this data analysis, I used the Airbnb open data avaiable here. I used the listing.csv file for New York.

Libraries and API

Since the csv file contained more than 20,000 entries, I decided to do some basic scrubbing first and then export to a different csv using the csv library. I then used the pandas library to manipulate and display selected data and used the matplotlib and seaborn libraries for visualization. To calculate the average distance from each listing to the top rated tourist attractions of New York, I used the Beautiful Soup library to parse the website and retrieve a list of attraction names. I then used the Google Places API to get each attraction spot's detailed latitude and longitude to calculate the great circle distance from each airbnb apartment.



In [3]:

    
import sys
import pandas as pd                   
import matplotlib.pyplot as plt      
import datetime as dt                  
import numpy as np                    
import seaborn as sns
import statistics
import csv
from scipy import stats
from bs4 import BeautifulSoup as bs
import urllib.request
from googleplaces import GooglePlaces, types, lang
from geopy.distance import great_circle
import geocoder

%matplotlib inline                     

print('Python version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())









    



/Users/sglyon/anaconda3/lib/python3.5/site-packages/matplotlib/__init__.py:878: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))






    



Python version:  3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]
Pandas version:  0.19.0
Today:  2016-12-22

Google Places API Configuration



In [4]:

    
apikey = 'AIzaSyAiZn9omCnuF2q89cArLpVfxmlGV7nnjFA'
gplaces = GooglePlaces(apikey)

Write a function to calculate the distance from each listing to top trip advisor attractions
Visualize data including where the closest ones are, the most expensive, the relative borderline
Write a new function that calculates the distance from the closest subwaby stations
somehow visualize the convenience and access from each subway station using Google maps API
decide where is the best value/distance
make a widget that allows you to copy and paste the link

Defining Functions

1. tripadvisor_attractions( url, how_many )

This function takes 2 parameters, the url of the trip advisor link and the number of top attractions one wants to check. It then uses the beautiful soup library to find the div that contains the list of top rated tourist attractions in the city and returns them as a list.



In [5]:

    
def tripadvisor_attractions(url, how_many):
    page = urllib.request.urlopen(url)
    #using beautiful soup to select targeted div
    soup = bs(page.read(), "lxml")
    filtered = soup.find("div", {"id": "FILTERED_LIST"})
    top_list = filtered.find_all("div", class_="property_title")
    
    sites = []
    #save the text within hyperlink into an empty list
    for site in top_list:
        site = (site.a).text
        site = str(site)
        if not any(char.isdigit() for char in site):
            sites.append(site)
    #splices the list by how many places user wants to include        
    sites = sites[:how_many]
    return sites

2. ta_detail(ta_list, city)

This function takes the list returned by the tripadvisor_attractions() function as well as the city name in a string. I explictly ask for the city name so that Google Places API will find more accurate place details when it looks up each tourist attraction. It returns a dataframe of the tourist attraction, its google place ID, longitude, and latitude.



In [6]:

    
#ta short for tourist attraction
def ta_detail(ta_list, city):
    ta_df = pd.DataFrame( {'Tourist Attraction' : '',
                           'place_id' : '',
                           'longitude' : '',
                           'latitude' : '' },
                        index = range(len(ta_list)))
    
    for i in range(len(ta_list)):
        query_result = gplaces.nearby_search(
                            location = city,
                            keyword = ta_list[i],
                            radius=20000)
        
        #get only the top first query
        query = query_result.places[0]
        ta_df.loc[i, 'Tourist Attraction'] = query.name
        ta_df.loc[i, 'longitude'] = query.geo_location['lng']
        ta_df.loc[i, 'latitude'] = query.geo_location['lat']
        ta_df.loc[i, 'place_id'] = query.place_id
        
    return ta_df

3. latlong_tuple(ta_df)

This function takes the tourist attraction data frame created above then returns a list of (latitude, longitude) tuples for every one of them.



In [7]:

    
def latlong_tuple(ta_df):
    tuple_list = []
    for j, ta in ta_df.iterrows():
        ta_geo = (float(ta['latitude']), float(ta['longitude']))
        tuple_list.append(ta_geo)
    return tuple_list

4. clean_csv(data_in, geo_tuples)

This function is the main data scraping function. I tried to first import the csv as a dataframe then clearning each entry, but the pandas iterrow and itertuple took a very long time so I decided to do the basic scrubbing when I was importing the csv. This function automatically saves a new copy of the cleaned csv with a file name extension _out.csv. The function itself doesn't return anything.



In [8]:

    
def clean_csv(data_in, geo_tuples):
    #automatically generates a cleaned csv file with the same name with _out.csv extension
    index = data_in.find('.csv')
    data_out = data_in[:index] + '_out' + data_in[index:]

    #some error checking when opening
    try:
        s = open(data_in, 'r')
    except:
        print('File not found or cannot be opened')
    else:
        t = open(data_out, 'w')
        print('\n Output from an iterable object created from the csv file')
        
        reader = csv.reader(s)
        writer = csv.writer(t, delimiter=',')
        
        #counter for number or rows removed during filtering
        removed = 0
        added = 0
        
        header = True
        for row in reader:
            if header:
                header = False
                
                for i in range(len(row)):
                    #saving indices for specific columns
                    if row[i] == 'latitude':
                        lat = i
                    elif row[i] == 'longitude':
                        lng = i
                row.append('avg_dist')   
                writer.writerow(row)
            
            #only add the row if the number of reviews is more than 1
            elif(int(row[-1]) > 7):
                #creaing a geo tuple for easy calculation later on
                tlat = row[lat]
                tlng = row[lng]
                ttuple = (tlat, tlng)
                
                dist_calc = []
                
                #calculate the distance from each listing and to every top tourist attractions we saved
                #if the distance is for some reason greater than 100, don't add it as it would skew the result.
                for i in geo_tuples:
                    dist_from_spot = round(great_circle(i, ttuple).kilometers, 2)
                    if (dist_from_spot < 100):
                        dist_calc.append(dist_from_spot)
                    else:
                        print(ta['Tourist Attraction'] + " is too far.")

                #calculates the average distance between the listing and all of the toursist attractions
                avg_dist = round(statistics.mean(dist_calc), 3)
                row.append(avg_dist)
                writer.writerow(row)              
                added += 1
            else:
                removed += 1
        
        s.close()
        t.close()
        print('Function Finished')
        print(added, 'listings saved')
        print(removed, 'listings removed')

Reading in the data: Time for fun!

Reading in the trip advisor url for New York and saving the data

In the cell below, we read in the trip advisor url for New York and save only the top 10 in a list. When we print it, we can validate that these are the famous places New York is famous for.



In [9]:

    
url = "https://www.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html"
top_10 = tripadvisor_attractions(url, 10)



In [10]:

    
print(top_10)









    



['Central Park', 'The Metropolitan Museum of Art', 'Top of the Rock Observation Deck', 'Manhattan Skyline', 'The High Line', 'Broadway', 'Grand Central Terminal', 'Frick Collection', 'Empire State Building', 'Bryant Park']



In [11]:

    
ta_df = ta_detail(top_10, 'New York, NY')
geo_tuples = latlong_tuple(ta_df)

ta_df









    Out[11]:






  
    
      
      Tourist Attraction
      latitude
      longitude
      place_id
    
  
  
    
      0
      Central Park
      40.7828647
      -73.9653551
      ChIJ4zGFAZpYwokRGUGph3Mf37k
    
    
      1
      The Metropolitan Museum of Art
      40.7794366
      -73.963244
      ChIJb8Jg9pZYwokR-qHGtvSkLzs
    
    
      2
      Top of The Rock
      40.7592454
      -73.9793043
      ChIJe7vKMf9YwokRIMYfDz7iF9o
    
    
      3
      Manhattan Skyline
      40.76361410000001
      -73.9780574
      ChIJbRAuwPlYwokR829KR7QQLVI
    
    
      4
      The High Line
      40.7479925
      -74.0047649
      ChIJ5bQPhMdZwokRkTwKhVxhP1g
    
    
      5
      Broadway Cameras & Electronics
      40.6859984
      -73.91624109999999
      ChIJWYmGLWtcwokRutgJugsHMu8
    
    
      6
      Grand Central Terminal
      40.75272619999999
      -73.9772294
      ChIJhRwB-yFawokRi0AhGH87UTc
    
    
      7
      The Frick Collection
      40.7712286
      -73.9672226
      ChIJHRH97uxYwokR6CBeDyn1jtg
    
    
      8
      Empire State Building
      40.7484405
      -73.98566439999999
      ChIJaXQRs6lZwokRY6EFpJnhNNE
    
    
      9
      Bryant Park
      40.7535965
      -73.98323259999999
      ChIJvbGg56pZwokRp_E3JbivnLQ

The cell below reads in the original csv file, removes some unwanted listings, and adds a new column that has the average distance from the top 10 Trip Advisor approved(!!) tourist attractions.



In [22]:

    
clean_csv("data/listings.csv", geo_tuples)









    



 Output from an iterable object created from the csv file
Function Finished
28455 listings saved
11772 listings removed

We then make a copy dataframe listing to play around with.



In [24]:

    
df = pd.read_csv('data/listings_out.csv')
print('Dimensions:', df.shape)

df.head()
listing = df.copy()









    



Dimensions: (28455, 17)



In [25]:

    
listing.head()









    Out[25]:






  
    
      
      id
      name
      host_id
      host_name
      neighbourhood_group
      neighbourhood
      latitude
      longitude
      room_type
      price
      minimum_nights
      number_of_reviews
      last_review
      reviews_per_month
      calculated_host_listings_count
      availability_365
      avg_dist
    
  
  
    
      0
      9147025
      Cozy City Island Cottage
      403032
      Diane
      Bronx
      City Island
      40.844870
      -73.789541
      Entire home/apt
      125
      2
      30
      2016-11-27
      2.61
      1
      151
      18.603
    
    
      1
      7949480
      City Island Sanctuary relaxing BR & Bath w Par...
      119445
      Linda & Didier
      Bronx
      City Island
      40.852054
      -73.788680
      Private room
      99
      1
      21
      2016-11-21
      1.96
      1
      179
      19.100
    
    
      2
      6627449
      Large 1 BDRM in Great location
      13886510
      Arlene
      Bronx
      City Island
      40.849775
      -73.786609
      Entire home/apt
      125
      3
      12
      2016-10-24
      0.69
      1
      246
      19.104
    
    
      3
      1886820
      Quaint City Island Community.
      9815788
      Steve
      Bronx
      City Island
      40.841144
      -73.783052
      Entire home/apt
      300
      7
      0
      NaN
      NaN
      1
      359
      18.849
    
    
      4
      16042478
      WATERFRONT STUDIO APARTMENT
      9117975
      Collins
      Bronx
      City Island
      40.853491
      -73.788607
      Private room
      200
      7
      0
      NaN
      NaN
      1
      168
      19.195

Visualizing the Data

Neighbourhood

First, I used the groupby function to group the data by neighbourhood groups. I then make 2 different data frames to plot the price and average distance.



In [43]:

    
area = listing.groupby('neighbourhood_group')
nbhood_price = area['price'].agg([np.sum, np.mean, np.std])
nbhood_dist = area['avg_dist'].agg([np.sum, np.mean, np.std])



In [44]:

    
fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True)
fig.suptitle('NY Neighbourhoods: Price vs Average Distance to Top Spots', fontsize=10, fontweight='bold')
nbhood_price['mean'].plot(kind='bar', ax=ax[0], color='mediumslateblue')
nbhood_dist['mean'].plot(kind='bar', ax=ax[1], color = 'orchid')

ax[0].set_ylabel('Price', fontsize=10)
ax[1].set_ylabel('Average Distance', fontsize=10)









    Out[44]:





<matplotlib.text.Text at 0x1234693c8>

Then I used the groupby function for neighbourhoods to see a price comparison between different New York neighbourhoods



In [45]:

    
area2 = listing.groupby('neighbourhood')
nb_price = area2['price'].agg([np.sum, np.mean, np.std]).sort_values(['mean'])
nb_dist = area2['avg_dist'].agg([np.sum, np.mean, np.std])



In [46]:

    
fig, ax = plt.subplots(figsize=(4, 35))
fig.suptitle('Most Expensive Neighbourhoods on Airbnb', fontsize=10, fontweight='bold')
nb_price['mean'].plot(kind='barh', ax=ax, color='salmon')









    Out[46]:





<matplotlib.axes._subplots.AxesSubplot at 0x1234abeb8>

The most expensive neighbourhood: Breezy Point

Why is Breezy Point so expensive? Below code displays the Airbnb listings in Breezy Point, which turned out to be the "Tremendous stylish hotel" which was the only listing in Breezy Point.



In [47]:

    
breezy = listing.loc[listing['neighbourhood'] == 'Breezy Point']
breezy









    Out[47]:






  
    
      
      id
      name
      host_id
      host_name
      neighbourhood_group
      neighbourhood
      latitude
      longitude
      room_type
      price
      minimum_nights
      number_of_reviews
      last_review
      reviews_per_month
      calculated_host_listings_count
      availability_365
      avg_dist
    
  
  
    
      3379
      15976657
      *Tremendous stylish hotel 15
      75432549
      Alvin
      Queens
      Breezy Point
      40.557278
      -73.926337
      Entire home/apt
      853
      1
      0
      NaN
      NaN
      1
      180
      22.345

The Second Most Expensive: Manhattan Beach

The second most expensive neighbourhood is also not in Manhattan, in contrast to the first visualization we did that showed Manhattan had the highest average Airbnb price. All apartments in Manhattan Beach turns out to be reasonably priced except "Manhattan Beach for summer rent" which costs 2,800 USD per night.

It seems that outliers are skewing the data quite significantly.



In [48]:

    
beach = listing.loc[listing['neighbourhood'] == 'Manhattan Beach']
beach









    Out[48]:






  
    
      
      id
      name
      host_id
      host_name
      neighbourhood_group
      neighbourhood
      latitude
      longitude
      room_type
      price
      minimum_nights
      number_of_reviews
      last_review
      reviews_per_month
      calculated_host_listings_count
      availability_365
      avg_dist
    
  
  
    
      18530
      6179073
      Mini Studio in NY city safest area
      32052000
      Gennady + Laura
      Brooklyn
      Manhattan Beach
      40.577702
      -73.939875
      Private room
      45
      2
      66
      2016-11-30
      3.49
      2
      333
      19.941
    
    
      18531
      15158791
      2 BD Modern Luxury Apt. Short Walk to Beach & ...
      84558839
      Mitch
      Brooklyn
      Manhattan Beach
      40.580853
      -73.939337
      Entire home/apt
      130
      2
      7
      2016-11-11
      2.96
      1
      323
      19.604
    
    
      18532
      13317969
      Bright spacious bedroom near beach
      56210238
      Andrew
      Brooklyn
      Manhattan Beach
      40.579907
      -73.948532
      Private room
      70
      2
      15
      2016-11-05
      2.80
      1
      179
      19.621
    
    
      18533
      6492185
      A beautiful and friendly beach area
      33929905
      Oleg
      Brooklyn
      Manhattan Beach
      40.578725
      -73.952505
      Entire home/apt
      129
      3
      22
      2016-11-21
      1.39
      1
      161
      19.724
    
    
      18534
      13170159
      Brooklyn Brand New Studio
      62535444
      Alex
      Brooklyn
      Manhattan Beach
      40.578667
      -73.952268
      Entire home/apt
      150
      1
      5
      2016-10-11
      0.90
      2
      364
      19.731
    
    
      18535
      1498283
      3 BEDROOM DUPLEX PENTHOUSE (HUGE)
      7915224
      Marina
      Brooklyn
      Manhattan Beach
      40.579772
      -73.940468
      Entire home/apt
      399
      1
      1
      2015-03-27
      0.05
      1
      365
      19.708
    
    
      18536
      13017510
      Manhattan Beach for summer rent
      71849131
      Elena
      Brooklyn
      Manhattan Beach
      40.579099
      -73.952404
      Entire home/apt
      2800
      30
      0
      NaN
      NaN
      1
      365
      19.684
    
    
      18537
      15507375
      Private Studio close to Manhattan
      32052000
      Gennady + Laura
      Brooklyn
      Manhattan Beach
      40.578208
      -73.941225
      Private room
      60
      3
      1
      2016-11-15
      1.00
      2
      330
      19.871
    
    
      18538
      13107627
      Private Bedroom w/ Private Bath
      72542338
      Rochette
      Brooklyn
      Manhattan Beach
      40.580780
      -73.953561
      Private room
      125
      3
      8
      2016-09-25
      1.33
      1
      364
      19.489

Room Type

To account for the price difference between room types, I grouped the data by the room_type column and made some visualizations.



In [49]:

    
area = listing.groupby('room_type')
room_price = area['price'].agg([np.sum, np.mean, np.std])
room_dist = area['avg_dist'].agg([np.sum, np.mean, np.std])



In [50]:

    
room_price['mean'].plot(title="Average Price by Room Type")









    Out[50]:





<matplotlib.axes._subplots.AxesSubplot at 0x12395ec50>



In [51]:

    
apt = listing.loc[listing['room_type'] == 'Entire home/apt']
apt = apt.sort_values('price', ascending=False)
apt.drop(apt.head(20).index, inplace=True)
apt.head()









    Out[51]:






  
    
      
      id
      name
      host_id
      host_name
      neighbourhood_group
      neighbourhood
      latitude
      longitude
      room_type
      price
      minimum_nights
      number_of_reviews
      last_review
      reviews_per_month
      calculated_host_listings_count
      availability_365
      avg_dist
    
  
  
    
      22149
      8515213
      Studio 250 New York City
      44837740
      David
      Manhattan
      Theater District
      40.756008
      -73.989090
      Entire home/apt
      2000
      1
      2
      2016-11-23
      0.18
      1
      344
      2.567
    
    
      13535
      7923681
      Lux Penthouse Duplex over 4000SF
      14607896
      John
      Manhattan
      Greenwich Village
      40.733531
      -73.993083
      Entire home/apt
      2000
      14
      0
      NaN
      NaN
      1
      365
      3.997
    
    
      21292
      8736827
      Rain or Shine Studios
      45837698
      Brian
      Brooklyn
      Sheepshead Bay
      40.585989
      -73.927617
      Entire home/apt
      2000
      1
      1
      2015-10-27
      0.07
      1
      365
      19.198
    
    
      22284
      346260
      5,000sq ft TriBeCa Loft, 38 windows, 13ft ceil...
      1755397
      Adam
      Manhattan
      Tribeca
      40.717345
      -74.003730
      Entire home/apt
      1999
      5
      1
      2012-12-31
      0.02
      1
      192
      5.734
    
    
      7942
      5277021
      All Star/Fashion weekend/VDAY Pad
      351938
      Baguettes
      Brooklyn
      Columbia St
      40.686723
      -74.002190
      Entire home/apt
      1800
      1
      0
      NaN
      NaN
      1
      90
      8.521



In [52]:

    
sns.jointplot(x='avg_dist', y="price", data=apt, kind='kde')









    



/Users/sglyon/anaconda3/lib/python3.5/site-packages/matplotlib/__init__.py:898: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
/Users/sglyon/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[52]:





<seaborn.axisgrid.JointGrid at 0x12363efd0>

Plotting the Entire Room listings without the top 20 most expensive ones show that there are 2 concentrated correlated areas between average distance and price. The bimodal distribution in average distance might be the concentration of Airbnb listings in Manhattan and Brooklyn



In [36]:

    
f, ax = plt.subplots(figsize=(11, 6))
sns.violinplot(x="neighbourhood_group", y="price", data=apt, palette="Set3")









    Out[36]:





<matplotlib.axes._subplots.AxesSubplot at 0x11ab016d8>

Plotting a violin diagram of the prices of all entire homes in different neighbourhood groups show us that Manhattan has more distrubted price range of apartments, albeit on the higher end, while Queens and Bronx have higher concentration of listings at a specific point at a lower price range.

Dealing with Outliers

To deal with some of the outliers at the top, I tried deleting the top 10 or 20 most expensive ones, but this method wasn't very scalable across the dataset neither was it an accurate depiction of the price variety. So I decided to first get an understanding of the most expensive listings in New York and then to create a separate dataframe that removes data entries with price higher or lower than 3 standard deviations from the mean.



In [53]:

    
fancy = listing.sort_values('price', ascending=False).iloc[:50]
fancy.head(10)









    Out[53]:






  
    
      
      id
      name
      host_id
      host_name
      neighbourhood_group
      neighbourhood
      latitude
      longitude
      room_type
      price
      minimum_nights
      number_of_reviews
      last_review
      reviews_per_month
      calculated_host_listings_count
      availability_365
      avg_dist
    
  
  
    
      18129
      9528920
      Quiet, Clean, Lit @ LES & Chinatown
      3906464
      Amy
      Manhattan
      Lower East Side
      40.713554
      -73.985069
      Private room
      9999
      99
      6
      2016-01-01
      0.47
      1
      83
      5.623
    
    
      23797
      12955683
      Great location Apt
      35664632
      Duan
      Manhattan
      Upper West Side
      40.800315
      -73.966309
      Private room
      9998
      2
      2
      2016-05-21
      0.29
      1
      336
      5.324
    
    
      15257
      10770507
      Comfortable one bedroom in Harlem
      2506722
      Maisha
      Manhattan
      Harlem
      40.825321
      -73.944530
      Entire home/apt
      9900
      1
      14
      2016-12-01
      1.65
      1
      154
      8.407
    
    
      24409
      13156212
      Great room in the Upper West Side
      21891775
      John
      Manhattan
      Upper West Side
      40.777222
      -73.980647
      Shared room
      9900
      1
      0
      NaN
      NaN
      1
      365
      3.171
    
    
      26655
      13882124
      Mona Liza
      82152033
      Jose
      Brooklyn
      Williamsburg
      40.710928
      -73.935731
      Private room
      9000
      1
      0
      NaN
      NaN
      1
      358
      6.523
    
    
      6781
      2953058
      Film Location
      1177497
      Jessica Doug
      Brooklyn
      Clinton Hill
      40.691371
      -73.967228
      Entire home/apt
      8000
      1
      1
      2016-09-15
      0.37
      10
      365
      7.643
    
    
      21343
      8704144
      Lovely Room , 1 Block subway to NYC
      15163256
      Faigie
      Brooklyn
      Sheepshead Bay
      40.607583
      -73.955418
      Private room
      6200
      2
      22
      2016-11-13
      1.63
      3
      350
      16.534
    
    
      19679
      1448703
      Beautiful 1 Bedroom in Nolita/Soho
      213266
      Jessica
      Manhattan
      Nolita
      40.721927
      -73.993787
      Entire home/apt
      5000
      1
      2
      2013-09-28
      0.05
      1
      365
      5.021
    
    
      27737
      14590978
      Creative Event Space in Williamsburg
      90483609
      Caitlin
      Brooklyn
      Williamsburg
      40.720590
      -73.959366
      Entire home/apt
      5000
      1
      0
      NaN
      NaN
      1
      55
      5.088
    
    
      15740
      12376888
      2 Bedroom near Times Square -
      18341284
      Clare
      Manhattan
      Hell's Kitchen
      40.761262
      -73.992730
      Entire home/apt
      4700
      30
      5
      2016-10-19
      0.91
      1
      153
      2.805



In [54]:

    
fancy.describe()









    Out[54]:






  
    
      
      id
      host_id
      latitude
      longitude
      price
      minimum_nights
      number_of_reviews
      reviews_per_month
      calculated_host_listings_count
      availability_365
      avg_dist
    
  
  
    
      count
      5.000000e+01
      5.000000e+01
      50.000000
      50.000000
      50.000000
      50.00000
      50.00000
      31.000000
      50.000000
      50.000000
      50.000000
    
    
      mean
      8.332858e+06
      2.451062e+07
      40.722822
      -73.975732
      3289.240000
      28.28000
      4.50000
      0.513548
      1.580000
      289.360000
      6.570160
    
    
      std
      5.087089e+06
      2.629486e+07
      0.056661
      0.025196
      2546.645762
      158.84777
      7.90053
      0.484524
      1.830468
      113.478715
      4.433811
    
    
      min
      1.474740e+05
      2.132660e+05
      40.579099
      -74.062775
      1500.000000
      1.00000
      0.00000
      0.020000
      1.000000
      17.000000
      2.419000
    
    
      25%
      3.500094e+06
      3.348941e+06
      40.695802
      -73.992561
      1612.500000
      1.00000
      0.00000
      0.180000
      1.000000
      180.000000
      3.959500
    
    
      50%
      8.609678e+06
      1.480857e+07
      40.724807
      -73.979625
      2150.000000
      1.00000
      1.00000
      0.320000
      1.000000
      359.500000
      5.306500
    
    
      75%
      1.304031e+07
      3.422357e+07
      40.764200
      -73.956874
      3337.500000
      4.50000
      5.75000
      0.775000
      1.000000
      365.000000
      7.625250
    
    
      max
      1.617308e+07
      9.862030e+07
      40.825321
      -73.927617
      9999.000000
      1124.00000
      43.00000
      1.650000
      10.000000
      365.000000
      19.684000

It is likely that some of the listings listed above are specifically for events and photography, rather than for traveler's accomodation. Also it seems like some of the hosts who didn't want to remove their listing from Airbnb but wasn't available to host rather listed the price as 9,900 USD.

Some of the listings that seemed "normal" but had a very high price were:

99.7 percent of the listings

Using simple statistic, I saved a new dataframe named reviewed that has more than 1 review and is at least within 3 standard deviations from the mean.



In [55]:

    
reviewed = listing.loc[listing['number_of_reviews'] > 1]
reviewed.describe()
reviewed = reviewed[((reviewed['price'] - reviewed['price'].mean()) / reviewed['price'].std()).abs() < 3]
reviewed.describe()









    Out[55]:






  
    
      
      id
      host_id
      latitude
      longitude
      price
      minimum_nights
      number_of_reviews
      reviews_per_month
      calculated_host_listings_count
      availability_365
      avg_dist
    
  
  
    
      count
      2.126500e+04
      2.126500e+04
      21265.000000
      21265.000000
      21265.000000
      21265.000000
      21265.000000
      21262.000000
      21265.000000
      21265.000000
      21265.000000
    
    
      mean
      7.793968e+06
      2.431723e+07
      40.730241
      -73.954964
      137.075758
      3.515072
      26.407994
      1.830831
      1.998260
      226.872937
      7.146000
    
    
      std
      5.113483e+06
      2.668143e+07
      0.053791
      0.042133
      94.695809
      11.141008
      33.544875
      1.537712
      2.582234
      127.619635
      3.657168
    
    
      min
      2.515000e+03
      1.783000e+03
      40.508684
      -74.239859
      12.000000
      1.000000
      2.000000
      0.030000
      1.000000
      8.000000
      2.275000
    
    
      25%
      3.143311e+06
      3.577509e+06
      40.691320
      -73.984426
      72.000000
      1.000000
      6.000000
      0.630000
      1.000000
      89.000000
      4.532000
    
    
      50%
      7.791224e+06
      1.322693e+07
      40.723944
      -73.956331
      110.000000
      2.000000
      14.000000
      1.430000
      1.000000
      285.000000
      6.274000
    
    
      75%
      1.289945e+07
      3.814982e+07
      40.763752
      -73.939342
      175.000000
      3.000000
      34.000000
      2.660000
      2.000000
      344.000000
      8.852000
    
    
      max
      1.613930e+07
      1.047817e+08
      40.911003
      -73.712936
      695.000000
      1250.000000
      380.000000
      14.560000
      33.000000
      365.000000
      35.601000



In [56]:

    
fig, axs = plt.subplots(1, 2, sharey=True)
fig.suptitle('Do Reviews and Price Matter?', fontsize=20, fontweight='bold')
reviewed.plot(kind='scatter', x='reviews_per_month', y='price', ax=axs[0], figsize=(16, 8))
reviewed.plot(kind='scatter', x='avg_dist', y='price', ax=axs[1])









    Out[56]:





<matplotlib.axes._subplots.AxesSubplot at 0x124e73048>

The 2 plots above try to find if there would be any relationship between the number of reviews per month (trust and approval) as well as the average distance from the top attractions. Reviews per month plot does not seem to display any positive correlation between price and user approval, which makes sense as there are many other factors that determine an apartment rental price than user approval.

The average distance plot shows an interesting negative correlation between average distance and price. The lower the average distance is, the higher the price seems to be.

Both graphs show that many hosts like to mark prices discretely, by increments of 5 or 10, as there is a heavy concentration of data along y axis along the grid lines.



In [57]:

    
f, ax = plt.subplots(figsize=(11, 5))
sns.boxplot(x="neighbourhood_group", y="price", hue="room_type", data=reviewed, palette="PRGn")









    Out[57]:





<matplotlib.axes._subplots.AxesSubplot at 0x125d2d668>

The scatterplot above shows how big of a discrepancy apartment prices in Manhattan is. The top 25% of the apartments in Manhattan range in price from 400 USD to more than 700 USD, while those in Bronx span range of just 200 to 300.



In [58]:

    
reviewed2 = reviewed[((reviewed['price'] - reviewed['price'].mean()) / reviewed['price'].std()).abs() < 2]
sns.jointplot(x='avg_dist', y="price", data=reviewed2, kind='kde')









    



/Users/sglyon/anaconda3/lib/python3.5/site-packages/matplotlib/__init__.py:898: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
/Users/sglyon/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[58]:





<seaborn.axisgrid.JointGrid at 0x125d6ccc0>

For a better visualization of the correlation between price and average distance, I plotted another graph with only the 95% of the dataset, (i.e. those with 2 standard deviations within from the mean). This joint plot shows that there are two highly concentrated areas of apartments around 5km away from top tourist attractions on average at price of around 90-100 USD per night, and those around 8km away and 50-60 USD per night.

Conclusion

By looking at several visualizations of the Airbnb data in New York, I was able to find some negative correlation between the average distance away from the most famous sights and price per night. Data grouped by neighbourhood yieled the expected result of the highest average price per night in Manhattan and the lowest in Bronx and Queen. The listings.csv data I used contained a summary of the data so it was not as easy to analyze the detailed factors in determining the price. Moving forward, however, I would love to analyze the detailed version of the open data to identify a more accurate correlation between price and apartment size, availability, reviews, average distace and so on.



In [ ]:

	Tourist Attraction	latitude	longitude	place_id
0	Central Park	40.7828647	-73.9653551	ChIJ4zGFAZpYwokRGUGph3Mf37k
1	The Metropolitan Museum of Art	40.7794366	-73.963244	ChIJb8Jg9pZYwokR-qHGtvSkLzs
2	Top of The Rock	40.7592454	-73.9793043	ChIJe7vKMf9YwokRIMYfDz7iF9o
3	Manhattan Skyline	40.76361410000001	-73.9780574	ChIJbRAuwPlYwokR829KR7QQLVI
4	The High Line	40.7479925	-74.0047649	ChIJ5bQPhMdZwokRkTwKhVxhP1g
5	Broadway Cameras & Electronics	40.6859984	-73.91624109999999	ChIJWYmGLWtcwokRutgJugsHMu8
6	Grand Central Terminal	40.75272619999999	-73.9772294	ChIJhRwB-yFawokRi0AhGH87UTc
7	The Frick Collection	40.7712286	-73.9672226	ChIJHRH97uxYwokR6CBeDyn1jtg
8	Empire State Building	40.7484405	-73.98566439999999	ChIJaXQRs6lZwokRY6EFpJnhNNE
9	Bryant Park	40.7535965	-73.98323259999999	ChIJvbGg56pZwokRp_E3JbivnLQ

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	avg_dist
0	9147025	Cozy City Island Cottage	403032	Diane	Bronx	City Island	40.844870	-73.789541	Entire home/apt	125	2	30	2016-11-27	2.61	1	151	18.603
1	7949480	City Island Sanctuary relaxing BR & Bath w Par...	119445	Linda & Didier	Bronx	City Island	40.852054	-73.788680	Private room	99	1	21	2016-11-21	1.96	1	179	19.100
2	6627449	Large 1 BDRM in Great location	13886510	Arlene	Bronx	City Island	40.849775	-73.786609	Entire home/apt	125	3	12	2016-10-24	0.69	1	246	19.104
3	1886820	Quaint City Island Community.	9815788	Steve	Bronx	City Island	40.841144	-73.783052	Entire home/apt	300	7	0	NaN	NaN	1	359	18.849
4	16042478	WATERFRONT STUDIO APARTMENT	9117975	Collins	Bronx	City Island	40.853491	-73.788607	Private room	200	7	0	NaN	NaN	1	168	19.195

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	avg_dist
18530	6179073	Mini Studio in NY city safest area	32052000	Gennady + Laura	Brooklyn	Manhattan Beach	40.577702	-73.939875	Private room	45	2	66	2016-11-30	3.49	2	333	19.941
18531	15158791	2 BD Modern Luxury Apt. Short Walk to Beach & ...	84558839	Mitch	Brooklyn	Manhattan Beach	40.580853	-73.939337	Entire home/apt	130	2	7	2016-11-11	2.96	1	323	19.604
18532	13317969	Bright spacious bedroom near beach	56210238	Andrew	Brooklyn	Manhattan Beach	40.579907	-73.948532	Private room	70	2	15	2016-11-05	2.80	1	179	19.621
18533	6492185	A beautiful and friendly beach area	33929905	Oleg	Brooklyn	Manhattan Beach	40.578725	-73.952505	Entire home/apt	129	3	22	2016-11-21	1.39	1	161	19.724
18534	13170159	Brooklyn Brand New Studio	62535444	Alex	Brooklyn	Manhattan Beach	40.578667	-73.952268	Entire home/apt	150	1	5	2016-10-11	0.90	2	364	19.731
18535	1498283	3 BEDROOM DUPLEX PENTHOUSE (HUGE)	7915224	Marina	Brooklyn	Manhattan Beach	40.579772	-73.940468	Entire home/apt	399	1	1	2015-03-27	0.05	1	365	19.708
18536	13017510	Manhattan Beach for summer rent	71849131	Elena	Brooklyn	Manhattan Beach	40.579099	-73.952404	Entire home/apt	2800	30	0	NaN	NaN	1	365	19.684
18537	15507375	Private Studio close to Manhattan	32052000	Gennady + Laura	Brooklyn	Manhattan Beach	40.578208	-73.941225	Private room	60	3	1	2016-11-15	1.00	2	330	19.871
18538	13107627	Private Bedroom w/ Private Bath	72542338	Rochette	Brooklyn	Manhattan Beach	40.580780	-73.953561	Private room	125	3	8	2016-09-25	1.33	1	364	19.489

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	avg_dist
22149	8515213	Studio 250 New York City	44837740	David	Manhattan	Theater District	40.756008	-73.989090	Entire home/apt	2000	1	2	2016-11-23	0.18	1	344	2.567
13535	7923681	Lux Penthouse Duplex over 4000SF	14607896	John	Manhattan	Greenwich Village	40.733531	-73.993083	Entire home/apt	2000	14	0	NaN	NaN	1	365	3.997
21292	8736827	Rain or Shine Studios	45837698	Brian	Brooklyn	Sheepshead Bay	40.585989	-73.927617	Entire home/apt	2000	1	1	2015-10-27	0.07	1	365	19.198
22284	346260	5,000sq ft TriBeCa Loft, 38 windows, 13ft ceil...	1755397	Adam	Manhattan	Tribeca	40.717345	-74.003730	Entire home/apt	1999	5	1	2012-12-31	0.02	1	192	5.734
7942	5277021	All Star/Fashion weekend/VDAY Pad	351938	Baguettes	Brooklyn	Columbia St	40.686723	-74.002190	Entire home/apt	1800	1	0	NaN	NaN	1	90	8.521

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	avg_dist
18129	9528920	Quiet, Clean, Lit @ LES & Chinatown	3906464	Amy	Manhattan	Lower East Side	40.713554	-73.985069	Private room	9999	99	6	2016-01-01	0.47	1	83	5.623
23797	12955683	Great location Apt	35664632	Duan	Manhattan	Upper West Side	40.800315	-73.966309	Private room	9998	2	2	2016-05-21	0.29	1	336	5.324
15257	10770507	Comfortable one bedroom in Harlem	2506722	Maisha	Manhattan	Harlem	40.825321	-73.944530	Entire home/apt	9900	1	14	2016-12-01	1.65	1	154	8.407
24409	13156212	Great room in the Upper West Side	21891775	John	Manhattan	Upper West Side	40.777222	-73.980647	Shared room	9900	1	0	NaN	NaN	1	365	3.171
26655	13882124	Mona Liza	82152033	Jose	Brooklyn	Williamsburg	40.710928	-73.935731	Private room	9000	1	0	NaN	NaN	1	358	6.523
6781	2953058	Film Location	1177497	Jessica Doug	Brooklyn	Clinton Hill	40.691371	-73.967228	Entire home/apt	8000	1	1	2016-09-15	0.37	10	365	7.643
21343	8704144	Lovely Room , 1 Block subway to NYC	15163256	Faigie	Brooklyn	Sheepshead Bay	40.607583	-73.955418	Private room	6200	2	22	2016-11-13	1.63	3	350	16.534
19679	1448703	Beautiful 1 Bedroom in Nolita/Soho	213266	Jessica	Manhattan	Nolita	40.721927	-73.993787	Entire home/apt	5000	1	2	2013-09-28	0.05	1	365	5.021
27737	14590978	Creative Event Space in Williamsburg	90483609	Caitlin	Brooklyn	Williamsburg	40.720590	-73.959366	Entire home/apt	5000	1	0	NaN	NaN	1	55	5.088
15740	12376888	2 Bedroom near Times Square -	18341284	Clare	Manhattan	Hell's Kitchen	40.761262	-73.992730	Entire home/apt	4700	30	5	2016-10-19	0.91	1	153	2.805

	id	host_id	latitude	longitude	price	minimum_nights	number_of_reviews	reviews_per_month	calculated_host_listings_count	availability_365	avg_dist
count	5.000000e+01	5.000000e+01	50.000000	50.000000	50.000000	50.00000	50.00000	31.000000	50.000000	50.000000	50.000000
mean	8.332858e+06	2.451062e+07	40.722822	-73.975732	3289.240000	28.28000	4.50000	0.513548	1.580000	289.360000	6.570160
std	5.087089e+06	2.629486e+07	0.056661	0.025196	2546.645762	158.84777	7.90053	0.484524	1.830468	113.478715	4.433811
min	1.474740e+05	2.132660e+05	40.579099	-74.062775	1500.000000	1.00000	0.00000	0.020000	1.000000	17.000000	2.419000
25%	3.500094e+06	3.348941e+06	40.695802	-73.992561	1612.500000	1.00000	0.00000	0.180000	1.000000	180.000000	3.959500
50%	8.609678e+06	1.480857e+07	40.724807	-73.979625	2150.000000	1.00000	1.00000	0.320000	1.000000	359.500000	5.306500
75%	1.304031e+07	3.422357e+07	40.764200	-73.956874	3337.500000	4.50000	5.75000	0.775000	1.000000	365.000000	7.625250
max	1.617308e+07	9.862030e+07	40.825321	-73.927617	9999.000000	1124.00000	43.00000	1.650000	10.000000	365.000000	19.684000

	id	host_id	latitude	longitude	price	minimum_nights	number_of_reviews	reviews_per_month	calculated_host_listings_count	availability_365	avg_dist
count	2.126500e+04	2.126500e+04	21265.000000	21265.000000	21265.000000	21265.000000	21265.000000	21262.000000	21265.000000	21265.000000	21265.000000
mean	7.793968e+06	2.431723e+07	40.730241	-73.954964	137.075758	3.515072	26.407994	1.830831	1.998260	226.872937	7.146000
std	5.113483e+06	2.668143e+07	0.053791	0.042133	94.695809	11.141008	33.544875	1.537712	2.582234	127.619635	3.657168
min	2.515000e+03	1.783000e+03	40.508684	-74.239859	12.000000	1.000000	2.000000	0.030000	1.000000	8.000000	2.275000
25%	3.143311e+06	3.577509e+06	40.691320	-73.984426	72.000000	1.000000	6.000000	0.630000	1.000000	89.000000	4.532000
50%	7.791224e+06	1.322693e+07	40.723944	-73.956331	110.000000	2.000000	14.000000	1.430000	1.000000	285.000000	6.274000
75%	1.289945e+07	3.814982e+07	40.763752	-73.939342	175.000000	3.000000	34.000000	2.660000	2.000000	344.000000	8.852000
max	1.613930e+07	1.047817e+08	40.911003	-73.712936	695.000000	1250.000000	380.000000	14.560000	33.000000	365.000000	35.601000